NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Accelerating Hybrid Quantized Neural Networks on Multi-tenant Cloud FPGA

https://doi.org/10.1109/ICCD56317.2022.00079

Kwadjo, Danielle Tchuinkou; Nghonda Tchinda, Erman; Mbongue, Joel Mandebi; Bobda, Christophe (October 2022, IEEE)

Full Text Available
Coarse-Grained Floorplanning for streaming CNN applications on Multi-Die FPGAs

https://doi.org/10.1109/ISPDC55340.2022.00014

Kwadjo, Danielle Tchuinkou; Tchinda, Erman Nghonda; Bobda, Christophe (July 2022, 21st International Symposium on Parallel and Distributed Computing (ISPDC))

Full Text Available
Deploying Multi-tenant FPGAs within Linux-based Cloud Infrastructure

https://doi.org/10.1145/3474058

Mbongue, Joel Mandebi; Kwadjo, Danielle Tchuinkou; Shuping, Alex; Bobda, Christophe (June 2022, ACM Transactions on Reconfigurable Technology and Systems)

Cloud deployments now increasingly exploit Field-Programmable Gate Array (FPGA) accelerators as part of virtual instances. While cloud FPGAs are still essentially single-tenant, the growing demand for efficient hardware acceleration paves the way to FPGA multi-tenancy. It then becomes necessary to explore architectures, design flows, and resource management features that aim at exposing multi-tenant FPGAs to the cloud users. In this article, we discuss a hardware/software architecture that supports provisioning space-shared FPGAs in Kernel-based Virtual Machine (KVM) clouds. The proposed hardware/software architecture introduces an FPGA organization that improves hardware consolidation and support hardware elasticity with minimal data movement overhead. It also relies on VirtIO to decrease communication latency between hardware and software domains. Prototyping the proposed architecture with a Virtex UltraScale+ FPGA demonstrated near specification maximum frequency for on-chip data movement and high throughput in virtual instance access to hardware accelerators. We demonstrate similar performance compared to single-tenant deployment while increasing FPGA utilization, which is one of the goals of virtualization. Overall, our FPGA design achieved about 2× higher maximum frequency than the state of the art and a bandwidth reaching up to 28 Gbps on 32-bit data width.
more » « less
Full Text Available
Exploring a Layer-based Pre-implemented Flow for Mapping CNN on FPGA

https://doi.org/10.1109/IPDPSW52791.2021.00025

Kwadjo, Danielle Tchuinkou; Mbongue, Joel Mandebi; Bobda, Christophe (June 2021, 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW))
null (Ed.)
Convolutional Neural Networks are compute-intensive learning models that have demonstrated ability and effectiveness in solving complex learning problems. However, developing a high-performance FPGA accelerator for CNN often demands high programming skills, hardware verification, precise distribution localization, and long development cycles. Besides, CNN depth increases by reuse and replication of multiple layers. This paper proposes a programming flow for CNN on FPGA to generate high-performance accelerators by assembling CNN pre-implemented components as a puzzle based on the graph topology. Using pre-implemented components allows us to use the minimum of resources necessary, predict the performance, and gain in productivity since there is no need to synthesize any HDL code. Furthermore, components can be reused for a different range of applications. Through prototyping, we demonstrated the viability and relevance of our approach. Experiments show a productivity improvement of up to 69% compared to a traditional FPGA implementation while achieving over 1.75× higher Fmax with lower resources and power consumption.
more » « less
Full Text Available
Performance Exploration on Pre-implemented CNN Hardware Accelerator on FPGA

https://doi.org/10.1109/ICFPT51103.2020.00055

Kwadjo, Danielle Tchuinkou; Mbongue, Joel Mandebi; Bobda, Christophe (May 2021, 2020 International Conference on Field-Programmable Technology (ICFPT))
null (Ed.)
As the complexity of FPGA architectures increases, there is a raising need to improved productivity and performance in several computing domains such as image processing, financial analytics, edge computing and deep learning. However, vendor tools are mostly general-purpose as they attempt to provide an acceptable quality of result (QoR) on a broad set of applications, which may not exploit application/domain-specific characteristics to deliver higher QoR. In this paper, we present a divide-and-conquer design flow that enables application/domain-specific optimization on the design of convolutional neural network (CNN) architectures on Xilinx FPGAs. The proposed approach follows three fundamental steps; Step 1: Break the design down into components, Step 2: Implement these separate components, and Step 3: Efficiently generate the final design by assembling pre-built components with minimal QoR lost. Recent research has even demonstrated that such approaches may provide better QoR than that of the traditional Vivado flow in some instances [1], [2]. By pre-implementing specific components of a design, higher performance can be achieved locally and maintained to a certain extent when assembling the final circuit. This approach is supported by two main observations [1]: (1) vendor tools such as Vivado tend to deliver high performance results on small modules in a design. (2) Computing applications such as machine learning designs increase in size by replicating modules. CNN inference refers to the forward propagation of M input images through L layers. The repetition of components within CNN architectures make them suitable candidates for RapidWright implementation as the CNN sub-modules can be optimized for performance in standalone, and the achieved performance can be preserved when replicating and relocating the modules across the FPGA.
more » « less
Full Text Available

Search for: All records